Automated Test: monitor-incident-refactor-after #329

admin-coderabbit · 2026-02-04T19:45:37Z

This pull request was automatically created by @coderabbitai/e2e-reviewer.

Batch created pull request.

Summary by CodeRabbit

New Features
- Incident detection now triggers based on configurable failure thresholds for monitors.
- Failure information is captured with context including environment, failure reasons, and associated traces.
- Monitor environment details are included in incident occurrence reports.

…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527

coderabbit-eval · 2026-02-04T19:46:19Z

📝 Walkthrough

Walkthrough

New incident occurrence handling module for monitors that creates and publishes monitor incident events when failure thresholds are exceeded. Introduces monitor environment context builders, failure reason composition, and incident threshold determination logic while refactoring mark_failed to delegate to centralized threshold logic.

Changes

Cohort / File(s)	Summary
Incident Occurrence Logic `src/sentry/monitors/logic/incident_occurrence.py`	New module introducing `create_incident_occurrence()` to build and publish IssueOccurrence events with evidence and event data. Adds `get_failure_reason()` to compose human-readable failure messages from check-in statuses and `get_monitor_environment_context()` to construct monitor context dictionaries. Includes `HUMAN_FAILURE_STATUS_MAP` and `SINGULAR_HUMAN_FAILURE_MAP` for localized status strings.
Incident Threshold Determination `src/sentry/monitors/logic/incidents.py`	New `try_incident_threshold()` function that determines whether to create incident occurrences based on monitor environment status and failure threshold, assembling prior check-ins and managing incident state transitions (OK/ACTIVE → ERROR).
Mark Failed Refactoring `src/sentry/monitors/logic/mark_failed.py`	Delegates core incident handling to `try_incident_threshold()`, removing 260 lines of internal threshold logic, incident creation, and occurrence handling code.
Type Definitions `src/sentry/monitors/types.py`	Adds `SimpleCheckIn` TypedDict with `id`, `date_added`, and `status` fields for lightweight check-in representation.

Sequence Diagram

sequenceDiagram
    participant CheckIn as Check-in Process
    participant MarkFailed as mark_failed()
    participant Threshold as try_incident_threshold()
    participant Occurrence as create_incident_occurrence()
    participant Kafka as Kafka/Event Bus

    CheckIn->>MarkFailed: failed_checkin, failure_issue_threshold
    MarkFailed->>Threshold: delegate check-in + threshold
    alt Threshold Check Passes
        Threshold->>Threshold: assess status & prior check-ins
        Threshold->>Threshold: create/retrieve MonitorIncident
        Threshold->>Occurrence: failed_checkins, incident data
        Occurrence->>Occurrence: build IssueOccurrence w/ evidence
        Occurrence->>Occurrence: construct event_data & contexts
        Occurrence->>Kafka: produce_occurrence_to_kafka()
        Kafka-->>Occurrence: published
    else Threshold Check Fails
        Threshold-->>MarkFailed: return False
    end
    Threshold-->>MarkFailed: return True/False
    MarkFailed-->>CheckIn: processing complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 twitches whiskers with glee
Monitor incidents now flow with grace,
Thresholds checked, occurrences embraced,
From check-in fails to Kafka's stream,
A refactored dream, pristinely clean! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Automated Test: monitor-incident-refactor-after' is vague and does not clearly describe the actual changes made to the codebase.	Replace with a more descriptive title that summarizes the main changes, such as 'Refactor monitor incident handling to use threshold-based logic' or similar.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch monitor-incident-refactor-after

_{Comment @coderabbit eval help to get the list of available commands and usage tips.}

coderabbit-eval

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@src/sentry/monitors/logic/incident_occurrence.py`:
- Around line 159-171: The function get_monitor_environment_context modifies a
local copy named config (from monitor_environment.monitor.config.copy()) but
returns the original monitor_environment.monitor.config instead of that modified
copy; update the returned dict to use the local config variable (config) so that
the adjusted "schedule_type" (and any other edits) are included; locate
get_monitor_environment_context and replace the "config":
monitor_environment.monitor.config entry with "config": config.
- Around line 130-156: In get_failure_reason, fix the docstring typo ("humam
readible" → "human readable"), change the generator membership check to "if
checkin['status'] in HUMAN_FAILURE_STATUS_MAP" (drop .keys()), replace
list(status_counts.keys())[0] with next(iter(status_counts)) to satisfy RUF015,
and add an explicit branch that returns a sensible default (e.g., _("A failed
check-in was detected") or similar) when status_counts is empty before building
human_status; keep references to status_counts, HUMAN_FAILURE_STATUS_MAP,
SINGULAR_HUMAN_FAILURE_MAP, and get_text_list when implementing these changes.

In `@src/sentry/monitors/logic/incidents.py`:
- Around line 72-87: When monitor_env.status == MonitorStatus.ERROR,
monitor_env.active_incident may be None; update the logic in the block that
builds previous_checkins and assigns incident = monitor_env.active_incident to
check for a missing incident and return early (or log a warning) instead of
proceeding to fire the signal and return True; specifically, in the code path
that handles MonitorStatus.ERROR (the code referencing previous_checkins and
incident), add an explicit guard that if incident is None then do not create
occurrences, avoid firing the existing signal, and return False (or emit a
warning log) so the function accurately reflects that no occurrence was created.

🧹 Nitpick comments (2)

src/sentry/monitors/logic/incidents.py (2)
27-53: Threshold-1 short-circuit is a nice optimization.

The branching logic is sound. Minor nit: line 52 wraps a list comprehension inside any() — a generator expression avoids materializing the entire list.
Nit: use generator expression
-            if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
+            if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins):
89-104: Re-fetching check-ins and iterating to create one occurrence per check-in.

Two observations:

N+1-ish pattern: Line 93 re-fetches MonitorCheckIn objects from the DB after already having their data in previous_checkins. For a threshold of 1 this is a single query, but for larger thresholds this round-trip may be avoidable if create_incident_occurrence were adapted to accept SimpleCheckIn instead of a full MonitorCheckIn. Not urgent given typical threshold sizes.

Ordering not guaranteed: The queryset on line 93 uses filter(id__in=...) without an order_by, so iteration order may differ from the chronological order in previous_checkins. If occurrence creation order matters (e.g., for trace linkage), consider adding .order_by("date_added").
Add explicit ordering
-        checkins = MonitorCheckIn.objects.filter(id__in=[c["id"] for c in previous_checkins])
+        checkins = MonitorCheckIn.objects.filter(
+            id__in=[c["id"] for c in previous_checkins]
+        ).order_by("date_added")

coderabbit-eval · 2026-02-04T19:53:31Z

src/sentry/monitors/logic/incident_occurrence.py

+def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
+    """
+    Builds a humam readible string from a list of failed check-ins.
+
+    "3 missed check-ins detected"
+    "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
+    "A failed check-in was detected"
+    """
+
+    status_counts = Counter(
+        checkin["status"]
+        for checkin in failed_checkins
+        if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
+    )
+
+    if sum(status_counts.values()) == 1:
+        return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
+
+    human_status = get_text_list(
+        [
+            "%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}
+            for status, count in status_counts.items()
+        ],
+        last_word=_("and"),
+    )
+
+    return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}


⚠️ Potential issue | 🟡 Minor

Edge case & minor issues in get_failure_reason.

Empty status_counts: If all check-in statuses are outside HUMAN_FAILURE_STATUS_MAP, sum(status_counts.values()) is 0, which falls through to line 148 and produces a garbled message like " check-ins detected". Consider handling this case explicitly.

Typo (line 132): "humam readible" → "human readable".

Ruff RUF015 (line 146): list(status_counts.keys())[0] → next(iter(status_counts)).

Redundant .keys() (line 142): in HUMAN_FAILURE_STATUS_MAP.keys() → in HUMAN_FAILURE_STATUS_MAP.

Proposed fixes

def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]): """ - Builds a humam readible string from a list of failed check-ins. + Builds a human readable string from a list of failed check-ins. "3 missed check-ins detected" "2 missed check-ins, 1 timeout check-in and 1 error check-in were detected" @@ -139,12 +139,15 @@ status_counts = Counter( checkin["status"] for checkin in failed_checkins - if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys() + if checkin["status"] in HUMAN_FAILURE_STATUS_MAP ) - if sum(status_counts.values()) == 1: - return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]] + if not status_counts: + return _("A check-in failure was detected") + + if sum(status_counts.values()) == 1: + return SINGULAR_HUMAN_FAILURE_MAP[next(iter(status_counts))]

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 146-146: Prefer next(iter(status_counts.keys())) over single element slice

Replace with next(iter(status_counts.keys()))

(RUF015)

🤖 Prompt for AI Agents

In `@src/sentry/monitors/logic/incident_occurrence.py` around lines 130 - 156, In get_failure_reason, fix the docstring typo ("humam readible" → "human readable"), change the generator membership check to "if checkin['status'] in HUMAN_FAILURE_STATUS_MAP" (drop .keys()), replace list(status_counts.keys())[0] with next(iter(status_counts)) to satisfy RUF015, and add an explicit branch that returns a sensible default (e.g., _("A failed check-in was detected") or similar) when status_counts is empty before building human_status; keep references to status_counts, HUMAN_FAILURE_STATUS_MAP, SINGULAR_HUMAN_FAILURE_MAP, and get_text_list when implementing these changes.

coderabbit-eval · 2026-02-04T19:53:31Z

src/sentry/monitors/logic/incident_occurrence.py

+def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
+    config = monitor_environment.monitor.config.copy()
+    if "schedule_type" in config:
+        config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()
+
+    return {
+        "id": str(monitor_environment.monitor.guid),
+        "slug": str(monitor_environment.monitor.slug),
+        "name": monitor_environment.monitor.name,
+        "config": monitor_environment.monitor.config,
+        "status": monitor_environment.get_status_display(),
+        "type": monitor_environment.monitor.get_type_display(),
+    }


⚠️ Potential issue | 🔴 Critical

Bug: modified config copy is never returned.

Line 160 creates a local copy of config and line 162 potentially updates schedule_type on that copy, but line 168 returns the original monitor_environment.monitor.config instead of the modified config.

🐛 Proposed fix

return { "id": str(monitor_environment.monitor.guid), "slug": str(monitor_environment.monitor.slug), "name": monitor_environment.monitor.name, - "config": monitor_environment.monitor.config, + "config": config, "status": monitor_environment.get_status_display(), "type": monitor_environment.monitor.get_type_display(), }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_monitor_environment_context(monitor_environment: MonitorEnvironment):

config = monitor_environment.monitor.config.copy()

if "schedule_type" in config:

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {

"id": str(monitor_environment.monitor.guid),

"slug": str(monitor_environment.monitor.slug),

"name": monitor_environment.monitor.name,

"config": monitor_environment.monitor.config,

"status": monitor_environment.get_status_display(),

"type": monitor_environment.monitor.get_type_display(),

}

def get_monitor_environment_context(monitor_environment: MonitorEnvironment):

config = monitor_environment.monitor.config.copy()

if "schedule_type" in config:

config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {

"id": str(monitor_environment.monitor.guid),

"slug": str(monitor_environment.monitor.slug),

"name": monitor_environment.monitor.name,

"config": config,

"status": monitor_environment.get_status_display(),

"type": monitor_environment.monitor.get_type_display(),

}

🤖 Prompt for AI Agents

In `@src/sentry/monitors/logic/incident_occurrence.py` around lines 159 - 171, The function get_monitor_environment_context modifies a local copy named config (from monitor_environment.monitor.config.copy()) but returns the original monitor_environment.monitor.config instead of that modified copy; update the returned dict to use the local config variable (config) so that the adjusted "schedule_type" (and any other edits) are included; locate get_monitor_environment_context and replace the "config": monitor_environment.monitor.config entry with "config": config.

coderabbit-eval · 2026-02-04T19:53:31Z

src/sentry/monitors/logic/incidents.py

+    elif monitor_env.status == MonitorStatus.ERROR:
+        # if monitor environment has a failed status, use the failed
+        # check-in and send occurrence
+        previous_checkins = [
+            {
+                "id": failed_checkin.id,
+                "date_added": failed_checkin.date_added,
+                "status": failed_checkin.status,
+            }
+        ]
+
+        # get the active incident from the monitor environment
+        incident = monitor_env.active_incident
+    else:
+        # don't send occurrence for other statuses
+        return False


⚠️ Potential issue | 🟡 Minor

monitor_env.active_incident may be None.

When the monitor is already in ERROR status, line 84 assigns incident = monitor_env.active_incident, which can be None if the incident was resolved between the status check and this line (race), or if data is inconsistent. Line 92 does guard against incident being falsy before creating occurrences, so this is safe — but the signal on line 102 still fires and True is returned even when no incident exists and no occurrence was created. Consider whether that's the intended behavior or if an early return / log warning would be more appropriate.

🤖 Prompt for AI Agents

In `@src/sentry/monitors/logic/incidents.py` around lines 72 - 87, When monitor_env.status == MonitorStatus.ERROR, monitor_env.active_incident may be None; update the logic in the block that builds previous_checkins and assigns incident = monitor_env.active_incident to check for a missing incident and return early (or log a warning) instead of proceeding to fire the signal and return True; specifically, in the code path that handles MonitorStatus.ERROR (the code referencing previous_checkins and incident), add an explicit guard that if incident is None then do not create occurrences, avoid firing the existing signal, and return False (or emit a warning log) so the function accurately reflects that no occurrence was created.

ref(crons): Reorganize incident creation / issue occurrence logic (#8…

c3b32a7

…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527

coderabbit-eval bot closed this Feb 4, 2026

coderabbit-eval bot reviewed Feb 4, 2026

View reviewed changes

Automated Test: monitor-incident-refactor-after #329

Automated Test: monitor-incident-refactor-after #329

admin-coderabbit commented Feb 4, 2026 •

edited by coderabbit-eval bot

Loading

coderabbit-eval bot commented Feb 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

coderabbit-eval bot left a comment

coderabbit-eval bot Feb 4, 2026

coderabbit-eval bot Feb 4, 2026

coderabbit-eval bot Feb 4, 2026

Automated Test: monitor-incident-refactor-after #329

Automated Test: monitor-incident-refactor-after #329

Conversation

admin-coderabbit commented Feb 4, 2026 • edited by coderabbit-eval bot Loading

Summary by CodeRabbit

coderabbit-eval bot commented Feb 4, 2026 • edited Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

coderabbit-eval bot left a comment

Choose a reason for hiding this comment

coderabbit-eval bot Feb 4, 2026

Choose a reason for hiding this comment

coderabbit-eval bot Feb 4, 2026

Choose a reason for hiding this comment

coderabbit-eval bot Feb 4, 2026

Choose a reason for hiding this comment

admin-coderabbit commented Feb 4, 2026 •

edited by coderabbit-eval bot

Loading

coderabbit-eval bot commented Feb 4, 2026 •

edited

Loading